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Abstract — Heterogeneous  processing  (HP)  is  a  technique  intended  for  (irand  Challenge  and  «>ther  high 
performance  computing  (HPC)  problems  and  for  bridging  the  gap  between  the  theoretical  potential  of 
parallel  processing  and  the  current  reality.  In  HP,  wc  aim  to  match  code  and  algorithms  to  best-suited 
architectures,  through  techniques  such  as  code  profiling  and  analytic  tseiichmarking.  .-Nssociative  comput¬ 
ing  (AsC)  combines  ideas  from  both  associative  memories  and  single  instruction  multiple  data  (SIMD) 
computers  to  look  at  new  ways  to  use  fine-grain  parallel  processors  to  achieve  results  beyond  what  is 
normally  done  by  using  spinoffs  from  sequential  or  multiple  instruction  multiple  data  (MIMD)  priKcssors. 
Heterogeneous  associative  processing  (HAsP)  is  a  generalization  of  the  concepts  of  both  HP  and  AsC. 
In  HAsP.  the  AsC  assumption  of  linking  each  datum  with  its  own  processor  is  generalized  to  assuming 
that  each  data  file  has  its  own  dedicated  computer.  This  paradigm  maps  onto  all  levels  of  granularity  and 
can  be  easily  emulated  on  most  machines.  The  goal  of  HAsP  is  to  allow  the  user  to  discuss  the 
heterogeneous  system  at  the  highest  possible  level  and  with  the  tightest  possible  synchronism.  HAsP  offers 
the  potential  of  combining  the  simplifying  programming  approaches  and  algorithmic  ellicicncies  of  AsC 
with  the  performance  of  HP. 


I.  INTRODUCTION 

Distributed  heterogeneous  high  performance  com¬ 
puting  (DH-HPC)  is  the  “tuned"  use  of  heterogenous 
suites  of  sequential  and  parallel  HPC  processors 
to  obtain  cost-cfTcctivc  HPC  and./or  mctacomputing 
performance.'  *  The  essence  of  DH-HPC  is  the  ability 
to  obtain  maximal  execution  by  mapping  the  compu¬ 
tational  tasks  onto  the  best-stiitcd  architecture. 
The  intent  is  that  for  problems  with  diverse  compu¬ 
tational  subtasks,  the  overall  performance  cost- 
cffcctivencss  will  be  better  than  any  comparable 
single  processor.  In  addition  we  aim  to  reduce 
the  applications  programming  cfTort  since  well- 
matched  code  leads  to  natural  implementations.  The 
beginning  forms  of  DH-HPC  were  cases  in  which 
codes  were  profiled  and  run  on  the  best-suited  ma¬ 
chine  with  the  data  following  along.  As  discussed 
below,  however,  wc  believe  that  the  ultimate  ex¬ 
pression  of  this  paradigm  is  reached  in  a  DH-HPC 
environment  in  which  the  data  arc  profiled  and 
remain  resident  on  the  best-suited  machine  and  differ- 
ent  instruction  streams  arc  pas,scd  to  it.  In  other 
words.  HAsP  is  an  HPC  form  of  MISD  processing. 
The  long-term  objective  is  to  develop  a  methodology 
suitable  for  a  heterogeneous  suite  of  supercomputers. 
The  methodology  should  be  cncctivc,  expressive, 
extensible  and  cfllcicnt.  Hy  clhcicnt,  we  mean  easy 
to  use  and  applicable  to  all  types  of  architecture. 
Hy  expressive  wc  mean  it  i.s  u.sablc  for  all  types 
of  problems.  Thus,  while  DH-HPC  problems  are 
our  primary  target,  our  goal  is  to  also  evaluate 
other  compute  intensive  problems  such  as  dynamic 
data  buses.  Hxtensihio  means  that  the  mcthmlology 
must  be  flexible  enough  to  accommodate  new 
machines  and  architectures  as  they  are  added  to  (he 


system.  To  be  efficient  the  methodology  must  support 
the  existence  and  addition  of  high  level  operators 
such  as  sum,  convolution,  matrix  multiply,  etc.  The 
short-term  objectives  for  HAsP  arc  the  application  of 
DH-HPC  and  associative  computing  principles  to 
develop  heterogeneous  processor  suites  spanning 
wide  problem  sets  and  to  develop  new  methods  for 
benchmarking,  code  and  data  profiling,  and  the 
intelligent  management  of  selected  DH-HPC  suites. 
A  major  short-term  objective  is  the  extension  of 
DH-HPC  paradigms  and  associative  computing  prin¬ 
ciples  to  heterogeneous  suites  forming  a  virtual  as¬ 
sociative  computer. 


2.  DISTRIBUTED  HETEROGENEOUS  HPC 

Shared  memory  and  message  passing  are  two  basic 
paradigms  of  computation  derived  from  conventional 
multi-u.scr  concepts  that  could  be  used  for  a  hetero¬ 
geneous  supeaomputer  system.  Linda'  is  a  shared 
memory  paradigm  based  on  an  associative  memory 
concept.  Data  (tuples  or  records)  to  be  processed 
are  “put"  into  a  shared  memory  and  idle  processors 
"get”  tuples  from  the  memory  to  work  on.  Linda 
assumes  equal  power  processors  which  treat  data  like 
re.sources. 

Actors*  is  a  message  passing,  fine  grain,  object 
oriented  approach  for  concurrent  computing.  U  u.scs 
three  primitives:  create:  send:  and  become.  Where  the 
get  and  put  primitives  of  Linda  generulixe  the  sharevi 
memory  model,  the  create  and  .send  primitives  of 
actors  gencralixe  the  process  creation  and  mevsage 
passing  concepts.  Actors  allows  you  to  reason  about 
the  system;  however,  reasoning  is  done  at  the  "actor", 
i.e,  low  level  message  ptissing  level 
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BBN’s  TCL  (tool  for  large  grained  concurrency) 
scs  high  level  linguistic  constructs  and  has  a  virtual 
lachinc  concept  for  organizing  parallelism.  The 
ompilcr  divides  the  program  into  continuations 
i/hich  arc  parceled  out  by  the  scheduler  for  execution 
m  the  most  appropriate  machine.  Implicitly,  this  is 
I  message  passing  or  object  oriented  approach, 
.c.  the  data  and  command  must  be  sent.  The  virtual 
nachine  primitives  range  from  host  language  (LISP) 
)rimitives  to  high  level  commands.  TCL  maintains 
t  local  view.  If  there  arc  multiple  users,  each  has 
heir  own  TCL  scheduler.  Thus  each  works  most 
ifhciently  if  it  has  application  specific  information 
ind  can  concurrently  interrogate  processors  from 
heir  current  status. 

Both  the  Linda  and  Acters  models  were  designed 
or  and  work  well  on  single  MIMD  computers,  but 
A'hen  the  concepts  are  moved  to  a  heterogeneous, 
distributed  network  environment,  considerable  over¬ 
head  may  be  incurred.  Linda  is  essentially  a  data 
driven  approach  controlled  by  the  instances  of 
data  tuples  in  shared  memory.  By  definition,  this 
approach  requires  a  considerable  amount  of  data 
sharing.  When  implemented  (as  intended)  on  a  shared 
memory  machine,  sharing  requires  only  memory 
reads  and  writes.  However,  if  the  paradigm  is  moved 
to  a  distributed  heterogeneous  system,  the  data 
must  be  physically  moved  resulting  in  considerable 
overhead.  Actors,  being  a  message  passing  model, 
assumes  (guarantees)  that  messages  are  delivered, 
therefore  it  is  possible  (probable)  that  large  portions 
of  system  resources  (both  hardware  and  software) 
arc  devoted  to  message  passing  (i.e,  buffering  and 
forwarding  data  and  commands)  not  computing. 

A  general  problem  of  most  conventional 
approaches  to  heterogeneous  computing  is  that  they 
arc  bottom  up.  That  is,  they  provide  relatively 
few  low  level  primitives  that  support  a  specific 
MIMD  paradigm  (Linda  shared  memory,  actors 
message  passing).  These  primitives  arc  intended  to  be 
imbedded  in  conventional  sequential  languages  (i.e. 
FORTRAN,  C,  etc.)  or  form  the  foundation  of  a 
specifically  designed  "high  level  language"  and  to  be 
executed  on  a  single  computer. 

The  bottom  up  approach  imposes  several  restric¬ 
tions  on  the  paradigms.  For  example,  in  Linda,  tuples 
arc  singular,  i.e.  they  represent  a  single  data  object  or 
record.  By  putting  pointers  in  a  tuple,  arrays  and 
structures  can  be  referenced,  but  the  basic  mode  of 
operation  is  that  the  normal  case  is  scalar,  the 
exception  is  parallel  and  requires  more  complicated 
syntax,  A  more  general,  more  powerful  system  is 
achicvcil  if  the  primitives  ifre  composed  of  entities 
which  include  both  parallel  and  scalar  as  equivalent 
cases. 

In  Actors  the  recommended  approach  to  execute  a 
loop  III  parallel  on  a  MIMD  machine  h  to  break 
it  into  individual  messages,  one  for  each  iteration  of 
the  loop.  This  approach  should  noi  lx  e.xlenileil  to  a 
heterogeneous  environment,  since  it  ailds  consider¬ 


able  overhead  to  execution  and  ignores  the  communi¬ 
cation  cfllcicncy  and  natural  parallelism  of  SIMD, 
vector  and  other  “tightly  coupled”  architectures. 
There  should  be  a  minimum  of  communication 
between  computers  to  determine  what  to  do  next.  A 
high  level  command  which  encompasses  both  parallel 
(i.e.  loops)  and  scalar  operations  is  needed. 

Efficient  system  management  demands  that  the  OS 
communications  be  at  the  highest  possible  level  but 
computation  be  at  the  lowest  level  of  parallelism. 
For  example,  function/operations  should  take  files  as 
arguments  instead  of  records  or  tuples,  but  vector 
operations  should  be  done  at  the  machine  level  not  at 
the  system  level. 

Most  distributed  operating  systems  arc  extensions 
of  MIMD  paradigms.  As  a  result,  there  is  no  natural 
way  to  match  the  proper  computer  to  the  compu¬ 
tation.  For  example,  in  Distributed  Linda,  both  a 
SIMD  and  a  MIMD  could  issue  a  "get”  command 
for  the  same  tupci.  The  race  condition  is  determined 
arbitrarily,  not  on  the  basis  of  which  computer  is 
better  suited  for  the  task.  The  basic  job  could  be 
programed  so  that  the  “best”  architecture  issues  the 
“get”;  but  then  the  “next  best”  architecture  may  be 
sitting  idle  while  the  “best”  one  is  doing  all  of  the 
work. 

Associative  computing’  is  a  programming  para¬ 
digm  developed  explicitly  for  massively  parallel 
SIMD  computers.  It  uses  the  concepts  that  each 
datum  has  its  own  dedicated  processor.  Thus,  in 
associated  computing,  commands  are  broadcast  to 
all  components.  Those  components  which  recognize 
that  the  commands  arc  applicable  to  them  (using 
associated  techniques)  execute  them.  The  associative 
computing  paradigm  can  be  combined  with  DINS  to 
form  an  efficient  comprehensive  DH-HPC  system. 

3.  JIETF.ROGF.NEOUS  ASSOCIATIVE  PBOCESSINC. 

(HAsP) 

The  basic  a,ssumption  of  associative  computing 
is  that  each  datum  has  its  own  dedicated  processor. 
In  the  HAsP  environment,  this  assumption  is  general¬ 
ized  to  assuming  that  each  data  file  has  its  own 
dedicated  (po-ssibly  parallel)  computer.  The  distinc¬ 
tion  between  an  associative  memory  and  an  associat¬ 
ive  computer  cannot  be  over-emphasized.  Associative 
memories  require  that  the  selected  data  be  moved  to 
a  central  processor  before  they  can  be  processcxi. 
An  associative  computer  lias  a  separate  processor 
for  every  datum  wherever  it  is  located  and  is  thus 
an  inherently  distributed  system.  The  assrx'iativc 
computing  paradigm  maps  well  onto  all  levels  of 
parallelism  from  low  level  massively  parallel  SIMD 
computers  to  high  level  heterogenemis  )xtrallehsm 
and  can  be  easily  emulatetl  on  most  machittes,  bvuh 
sequential  and  parallel, 

I  he  goal  of  ll.AsP  is  to  vievelop  an  cnviu'umeut 
that  allows  the  user  to  rliscuss  the  heterogeusvus 
system  at  the  highest  |n»ssible  lexel  ami  xxith  the 
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possihic  ^yllchronlsnl,  TIkii  is.  Ihe  priinilivcs 
^tunlM  cnamipiiss  p;ir;illclism  ;ind  holordgeiicous 
LOinpin.itioM  ;is  ilic  norm  and  treat  "se<.|iientiid  von 
Neiiinann  eompiit;ition"  :is  a  special  ease. 

Ihe  II.AsI’  paradigm  makes  several  assumptions 
lor  large  heterogeneous  computing  networks: 

(1)  there  is  much  more  data  than  code; 

(2)  data  sets  typically  have  one  or  a  few  (related) 
natural  organizations  that  are  basic  to  the  various 
instructions  streams  that  may  be  directed  against 
them; 

(.^)  the  cost  of  communication  is  high  compared 
with  the  cost  of  computation  (or  equivalently,  the 
speed  of  transmission  is  slow  compared  with  the 
speed  of  computation);  and 

(4)  any  system  can  be  viewed  as  a  virtual  associat¬ 
ive  computer  with  a  (small)  common  set  of  com¬ 
mands  and  a  multiplicity  of  data  tuples  each  with 
their  own  processor. 

The  conclusion  is  that  the  data  should  be  sent  to 
the  most  appropriate  computer  initially  and  the  code 
sent  to  the  data. 

In  our  opinion  HAsP  can  be  viewed  as  a  MISD 
paradigm,  i.e.  one  in  which  various  instructions 
streams  are  sent  to  any  given  machine  (or  CPU) 
which  holds  those  data  sets  best  suited  for  it. 

A  layered  operating  system,  a  virtual  machine 
organization,  automated  data  conversion,  code  and 
data  profiling,  and  a  layered  metacompiler  are  all 
important  concepts  for  HAsP.  At  first  glance,  HAsP 
is  similar  to  message  passing  systems;  however,  there 
are  five  significant  differences.  For  example: 

(1)  In  HAsP,  commands  are  broadcast  to  the 
entire  system,  not  to  specific  nodes.  Nodes  select 
commands  based  on  their  data  content,  not  address. 
A  command  consists  of  an:  (i)  action;  and  (ii)  argu¬ 
ment  patterns  delineated  by  keywords. 

(2)  Data  movement  is  minimized.  Commands 
rarely  contain  data.  However,  commands  to  move 
data  may  be  sent  and  as  a  special  case,  a  command 
may  cause  a  node  to  send  a  reply. 

(3)  When  a  command  is  received  the  argument 
patterns  are  used  to  search  the  local  data  base  for 
items  that  match.  In  a  multi-user  environment, 
part  of  the  pattern  may  include  user  id  and/or  job 
numbers.  Matched  items  arc  flagged.  The  flags 
arc  attached  to  the  appropriate  keywords  and  control 
is  passed  to  the  action  software.  Pattern  matching 
is  performed  in  a  mode  appropriate  for  the  local 
cnmputcr.  On  a  SIMD  machine,  the  pattern  match¬ 
ing  is  done  in  parallel.  On  a  MIMD  machine, 
it  “could"  be  done  using  message  passing  of  the 
shared  memory  paradigm.  On  a  sequential  machine 
it  would  be  done  sequentially  on  sorted  tuples  or 
by  hashing. 

(4)  It  is  possible  for  two  or  more  commands  to  be 
issued  which  match  the  same  data  items.  In  these 
cases  a  "state"  item  would  be  included  to  make  tuples 


unique.  Ai  ihe  end  of  an  action  cycle  the  siaie  would 
be  updated. 

(5)  “l.oad  balancing"  is  done  dynamically  on  in¬ 
itial  data  load,  fhai  is,  the  HAsP  heterogeneous 
compiler  using  static  code  profiling  would  consuli  the 
concordance  to  determine  the  best  set  of  computers 
to  use.  At  run  time,  dynamic  data  profiling  would  be 
used  to  rclinc  the  dccisionj,Thcn  the  data  would 
be  input  directly  to  the  appropriate  computer.  Unlike 
OOP  approaches  that  must  (at  least  conceptually) 
move  data  and  code  from  node  to  node,  HAsP 
emphasizes  the  movement  of  code  alone.  Once  the 
data  have  been  input  to  a  computer,  they  are  rarely 
moved  or  copied. 

3. 1 .  Virtual  heterogeneous  associative  machine 

In  HAsP,  a  layered  view  of  heterogeneous  pro¬ 
cessing  is  advocated.  Each  layer  consists  of  a  virtual 
heterogeneous  associative  machine  (VHAM).  Thus 
there  is  a  VHAM  for  large  area  (nation  wide)  net¬ 
works.  A  VHAM  for  region  wide,  statewide,  building 
wide,  and  local  area  networks  as  well  as  a  VHAM  for 
a  single  heterogeneous  computer.  Not  all  HAsP 
systems  need  have  all  levels,  indeed  three  or  four 
levels  of  VHAMs  would  probably  be  most  common. 
Where  the  codc/data  is  known  to  be  mutually  exclu¬ 
sive,  multiple  associative  commands  can  be  issued. 
At  the  top  most  level,  the  HAsP  commands  at  each 
level  are  of  a  coarse  enough  granularity  that  a  single 
physical  channel  or  bus  could  be  divided  into  several 
time  shared  concurrent  channels. 

VHAM  consists  of  three  parts.  First,  it  is  a  set  of 
“instructions”  which  defines  a  virtual  heterogeneous 
associative  machine.  Second,  it  contains  an  execution 
engine  which  processes  HAsL  instruction.  Third,  it 
is  a  system  of  protocol  where  by  new  user  defined 
“instructions”  can  be  added  to  the  system.  Thus 
VHAM  is  a  paradigm  with  a  predefined  minimal 
run  time  expandable  instruction  set  and  execution 
protocol.  In  conventional  machines,  instructions  are 
delivered  to  a  CPU  and  they  are  executed  without 
question.  In  the  VHAM,  instructions  arc  broadcast 
to  all  of  the  cells,  but  each  cell  must  determine 
whether  to  execute  the  instruction.  This  determi¬ 
nation  is  performed  as  follows;  upon  receipt  of  an 
instruction,  a  node  “unifies"  it  with  its  local  instruc¬ 
tion  set  of  library  calls  and  extended  instruction  and 
datafiles.  If  there  is  a  match,  the  appropriate  routine 
is  called.  The  node  will  perform  format  conversion  if 
necessary.  The  called  "instruction”  may  in  turn  issue 
VHAM  instructions.  Thus  control  is  distributed  even 
though  every  level  of  the  HAsC  has  a  designated 
control  node.  That  is,  a  "program”  starts  by  issuing 
a  command  from  the  designated  control  node.  If  a 
receiving  node  receives  a  command  that  is  in  efifect  a 
subroutine  cull,  it  may  become  a  transporter  node. 
It  may  first  perform  some  local  computations  and 
then  start  issuing  (broadcasting)  commands  of  its 
own.  If  the  node  happens  to  he  a  port  node,  the 
commands  arc  issued  to  its  subset  as  well  as  to  its 
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own  network.  Thus  it  is  possible  (even  probable) 
that  multiple  instructions  streams  will  be  broadcast 
simultaneously. 

3.2.  A  virtual  associative  computer 

The  virtual  machine  organization  should  be  as 
closely  coupled  as  possible.  The  main  question  is: 
what  are  the  virtual  commands?  The  commands 
should  be  very  high  level  (i.e.  convolve,  fit,  Gauss) 
consisting  of  entire  programs  or  algorithms.  The 
same  algorithm  may  be  in  two  or  three  difierent 
forms;  one  for  each  type  of  machine  in  the  system. 
A  concordance  records  all  of  the  forms  and  related 
parameters.  For  example,  convolve  may  have  four 
different  morphs,  one  for  each  machine  type  or 
subtype,  i.e.  hypcrcubc  SIMD,  grid  SIMD  or  compu¬ 
tation  model,  messsage  passing  or  shared  memory. 
Each  concordance  entry  states  its  parameters,  speed, 
data  format,  etc. 

Ideally,  the  layered  OS  language  would  be  the  same 
as  the  virtural  machine  and  concordant  application 
languages.  The  commands  from  different  OS  levels 
may  overlap.  They  would  range  from  basic  to  com¬ 
plex  depending  on  the  level  of  the  OS.  Although 
the  OS  and  application  languages  would  share  the 
same  vocabulary,  the  OS  language  is  real  time  and 
interactive.  The  application  language  is  “compiled” 
and  executed  as  a  background  job.  There  would  be  a 
“macro”  mode  of  operation  where  the  user  would 
enter  his  job  inter-actively  via  the  OS  language. 
When  he  got  the  correct  results,  the  history  of  the  OS 
command  could  be  saved  in  a  file,  edited  if  necessary, 
and  compiled  and  executed  as  a  program. 

3.3.  An  associative  operating  system 

The  associate  operating  system  would  consist 
of  one  jjer  level  per  virtual  associative  computer. 
In  order  to  develop  a  layered  OS,  all  aspects  of  a 
conventional  operating  system  needed  to  be  separ¬ 
ated,  analyzed  and  then  reorganized.  For  example, 
currently  in  a  heterogeneous  system,  a  conventional 
multi-user  operating  system  is  at  the  bottom  layer 
of  the  system.  All  operating  systems  have  a  data 
move  function  from  one  disk  file  to  another.  In  a 
heterogeneous  system  however,  this  function  needs 
to  be  generalized  to  including  moving  files  from  one 
file  server  to  another  and  should  be  put  at  a  higher 
level  in  the  hierarchy.  That  is,  general  file  moving  is 
not  a  primitive  but  a  high  level  function.  Conven¬ 
tional  localized  file  moving  is  of  course  a  primitive 
function  and  is  the  degenerate  case  of  general  file 
moving. 

3.4.  Automatic  tlata  corwersion 

The  most  important  aspect  of  this  approach  is 
to  minimize  the  amount  of  data  movement  in  the 
system.  However,  when  data  is  moved,  it  will  be 
auloniatieally  converted  from  one  format  to  another. 
For  example,  if  tlala  is  mttveil  from  a  woril  serial 
MIMD  or  vector  machine  enviromnent  to  a  bit  serial 


computer,  the  data  must  be  “corner  turned”.  Data 
reformatting  would  be  handled  automatically  by  the 
OS  and  application  languages  just  as  float  to  fixed 
conversion  is  handled  in  conventional  languages. 

3.5.  Metacompilers 

The  concept  of  developing  a  metacompiler  for  a 
heterogeneous  group  of  computers  is  very  enchant¬ 
ing,  but  very  difficult.  The  efforts  from  the  area 
of  vectorizing  compilers  might  be  a  first  step,  but 
they  emphasize  transforming  code  designed  for  one 
class  of  machines  and  transforming  it  to  execute  on 
another  closely  related  class.  Furthermore,  current 
compilers  make  no  effort  to  attempt  to  determine  the 
“best”  machine  for  execution  and  the  conventional 
analysis  techniques  for  converting  code  to  flow 
graphs  is  slow  and  may  not  be  the  most  effective 
approach.  That  is,  with  the  current  technology,  it 
should  be  possible  to  analyze  a  code  and  distribute 
it  among  a  suite  of  machines,  but  a  sequential 
algorithm  cannot  be  analyzed  and  replaced  by  a  new 
parallel  algorithm. 

The  automatic  detection  of  parallelism  is  basically 
limited  to  nested  loops  in  the  initial  code.  For 
example  in  Linpack,  vectorization  can  be  used  to 
optimize  the  inner  most  loops,  but  searching  for 
idioms  such  as  finding  the  maximum  value  of  a 
vector  and  replacing  it  with  a  (SIMD)  parallel  maxval 
function  is  much  more  complex  because  of  the  variety 
of  ways  in  which  the  function  can  be  expressed. 
Current  technology  calls  for  the  use  of  “patterns". 
A  different  pattern  must  be  used  for  each  possible 
realization  of  the  function.  This  is  an  ad  hoc  approach 
and  is  not  a  suitable  solution.  The  traditional  analysis 
techniques  may  not  be  applicable.  For  example, 
traditional  data  flow  analysis  provides  reaching 
definitions,  available  expressions  and  loop  optimiz¬ 
ation  information.  This  information  may  be  very 
useful  for  planning  the  top  level  virtual  machine 
organization;  however,  in  a  data  parallel  language 
this  information  is  not  normally  useful.  That  is, 
in  a  conventional  sequential  language,  flow  of 
control  is  based  on  the  relationship  between  scalar 
variables.  However,  in  a  (data)  parallel  language 
control  is  determined  locally  by  datum  specific  logic. 
Indeed,  this  kind  of  control  is  equivalent  to  using 
arrays  of  variables  in  a  sequential  computer.  As 
always,  pointers  and  arrays  create  situations  which 
arc  very  difficult  to  handle  using  traditional  analysis 
techniques. 

■I.  <  ()t)K  K.\Kt'llTU)N  MODKl.I.INt; 

In  a  heterogeneous  supercomputer  environment,  it 
is  imperative  to  dynamically  assign  jobs  to  computers 
in  such  ii  vviiv  its  to  optimize  either  throughput  or 
e.xeeution  speed  (or  both,  itieidly).  This  meluvies 
diviiling  jobs  into  subiasks  which  e.veeute  I'ptu'itall) 
on  Vlli'XMs  ,ii  all  levels.  It  is  m>l  sullieient  tv'  ,tNMgn 
tiisks  on  a  first  ci'me  fust  servevi  btisis  or  some  '>m\ple 


priorii>  scheme.  Opiiiiuil  results  can  only  he  acliieveci 
b>  code  execution  moiieling.  (.'ode  execution  model¬ 
ing  includes  the  ability  to  ;ieeur;iteK  predict  how  a 
code  data  set  combination  will  execute  on  a  \'M  AM 
It  incorporates  components  of  henclimarkine.  code 
profiling  and  data  proliling  I'his  section  pros  ides 
background  on  these  topics  and  describes  a  prototype 
s\  stem. 

4.1.  Benchmarking 

Benchmarks  arc  commonly  used  to  lesl  and  evalu¬ 
ate  codes,  algorithms  and  machines,  and  have  long 
been  used,  especially  for  HPC.  Nevertheless  there  arc 
fundamental  differences  in  the  underlying  uses  of 
benchmarks  that  often  lead  to  semantic  misinterpre¬ 
tation  and  confusion.  For  example,  scientific  users 
of  HPC  often  want  results  from  benchmarks  as  an 
indicator  of  how  their  existing  code  will  run  on  new 
machines.  Designers  of  new  algorithms  or  machines 
often  want  to  know  the  future  potential,  including 
particularly  the  result  of  radical  redesign  of  code 
and  algorithm.  The  term  “peak  performance”  has 
often  been  reserved  for  this  last  concept,  even  though 
sustained  code  performance  seldom  comes  close 
(although  intelligent  assembley  language  coding  can 
sometimes  lead  to  sustained  pierformances  several 
times  faster  than  “peak”).  One  of  the  more  interesting 
recent  approaches  was  that  of  Gustafson  et  at.*' 
in  which  they  proposed  a  scalable  methodology 
(SLALOM)  in  which  the  amount  of  work  done  in 
a  fixed  time  is  the  key  measure  of  performance, 
rather  than  the  amount  of  time  to  do  fixed  work. 
Furthermore  the  SLALOM  approach  emphasizes 
the  need  to  solve  the  problem,  not  run  a  particular 
code  (which  has  been  written  in  a  style  inherently 
favoring  one  type  of  architecture).  We  propose  some 
refinement  of  terminology  in  order  to  expand  on  the 
differing  levels  of  benchmarking  by  examining  several 
situations; 

(a)  Consider  the  case  of  large  physical  simulation 
code,  e.g.  climate  modeling.  It  may  be  impractical  to 
make  radical  changes  in  the  code  or  algorithms  in  the 
near  or  intermediate  future-,  benchmarks  needed. 

(b)  Let  “Benchmark  S-f  be  reserved  for  codes 
with  tittle  or  no  “tuning”,  e.g.  the  LINPACK  test  set. 

(c)  Let  “Predictor  Set”  be  reserved  for  codes  in 
which  significant  rewriting  of  code,  including  assem¬ 
bly  language,  is  permitted,  e.g.  the  PERFECT  Club 
suite.  In  the  case  of  new  vendor  products,  this  would 
offer  the  vendors  the  challenge  (and  opportunity)  to 
do  the  best  they  can,  on  real  problems. 

(d)  Let  “Subproblcm  Set”  be  reserved  for  cases  in 
which  we  change  the  algorithm,  e.g.  moving  from  one 
kind  of  sort  to  another,  as  might  happen  in  optimally 
moving  from  one  type  of  architecture  to  another. 

(e)  Let  “Problem  Set"  be  reserved  for  cases  in 
which  the  whole  approach  might  be  changed,  e.g. 
Potter’  has  clearly  demonstrated  that  in  moving 
from  von  Neumann  machines  to  SIMD  machines. 


searching,  in  an  assuciaiixc  computing  environment, 
can  be  more  etVective  than  traditional  sorting.  The 
specification  for  a  Problem  Set  might  merely  consist 
of  w'ork  problems  that  need  to  be  computed  in  any 
manner. 

A  distinction  between  benchmarking  and  code 
profiling  also  needs  to  be  made.  Benchmarking  is  the 
process  of  establishing  a  suite  of  codes  to  model  a 
"typical”  workload  so  that  different  architectures 
and  machines  can  be  compared.  However,  most 
benchmarks  have  been  developed  for  a  traditional 
sequential  machine  environment.  Vector  machines 
have  been  developed  to  optimize  code  written  for  this 
type  of  environment.  On  the  other  hand  SIMDs 
were  developed  as  an  independent  architecture. 
In  a  heterogeneous  HPC  system,  a  more  rigorous 
general  purpose  approach  for  comparing  computers 
is  needed  so  that  computing  resources  can  be 
assigned  dynamically.  Freund  and  Peterson’  have 
proposed  a  formulation  for  determining  the  best  task 
assignments  in  a  DH-HPC  environment.  Dynamic 
assignment  requires  that  the  performance  of  the 
currently  available  computers  on  waiting  jobs  can 
be  predicted  in  such  a  way  that  they  can  be  meaning¬ 
fully  compared  so  that  an  optimal  assignment  can 
be  achieved. 

Code  profiling  is  the  technique  of  analyzing  pro¬ 
grams  to  determine  how  they  may  be  optimized  for 
execution  on  any  given  VHAM.  In  a  heterogeneous 
supercomputer  environment,  code  profiling  can  be 
combined  with  benchmarking  to  accomplish  code 
execution  modeling. 

This  section  on  code  execution  modeling,  defined 
below,  is  divided  into  two  subsections;  throughput 
prediction  and  data  profiling.  The  first  proposes  an 
approach  for  code  profiling  including  a  set  of  atomic 
commonly  used  parallel  operations  which  can  be 
easily  benchmarked  and  then  combined  into  more 
complex  formulations  to  not  only  predict  the  time  of 
execution  for  a  piece  of  code,  but  to  also  provide  an 
overall  estimate  of  throughput  for  an  entire  DH-HPC 
system.  An  important  aspect  of  this  work  is  the 
ability  to  predict  future  performance;  and  while  the 
approached  described  below  can  be  laborious,  it 
is  intended  that  the  modeling  be  automated  using 
techniques  developed  for  conventional  vectorizing 
compilers. 

4.1.1.  Throughput  pretiiciion.  This  paper  hypoth¬ 
esizes  that  an  important  class  of  codes  can  be 
modeled  as  alternating  sequences  of  scalar  and 
basic  parallel  operations  and  that  these  codes  can 
be  meaningfully  compared  on  vector  and  SIMD 
machines.  These  basic  sequences  can  be  combined 
in  useful  ways  to  model  the  operation  of  the  code. 
The  basic  sequences  in  turn  arc  made  of  component 
operations  which  cun  be  combined  to  produce  an 
estimate  of  the  throughput  of  a  machine  for  the 
sequence.  Scalar  sequences  are  assumed  to  consist  of 
unit  operations,  so  that  the  throughput  for  a  scalar 
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sequence  is  just  the  reciprocal  of  the  number  of  scalar 
operations  times  the  scalar  execution  speed. 

Vector  sequences  are  assumed  to  be  composed 
of  VECOPS.*  The  VECOPS  benchmarks  are  a  set 
of  vector  operations  which  are  frequently  used  by 
physical  scientists  in  their  work.  VECOPS  can  be 
combined  using  the  equations  developed  below  to 
produce  an  estimate  of  the  throughput  for  the  vector 
sequence.  Accordingly,  two  equations  have  been 
generated  to  predict  the  throughput  of  SIMD  and 
vector  machines.  For  SIMD  machines,  let  v  be  the 
vector  length,  n  the  number  of  processors,  r  the 
quoted  (maximum)  rate  for  the  arithmetic  operation 
and  /  the  resultant  throughput;  then 


(1) 


On  a  vector  machine  the  throughput  raises  asymp¬ 
totically  to  the  maximum  rate  very  quickly.  On  a 
SIMD  machine,  the  throughput  rises  linearly  until 
the  size  of  the  machine  (i.e.  the  number  of  PEs) 
is  exceeded  then  falls  to  the  average  rate  reflecting 
the  average  of  the  full  vector  and  the  nearly  empty 
one.  It  rises  linearly  again  until  the  array  size 
is  exceeded  again  and  then  falls  to  2/3,  etc.  (as 
SIMD  machines  have  a  relatively  slow  cycle  time,  the 
throughput  is  low  when  the  machine  is  partially 
loaded). 

These  above  equations  answer  the  questions:  given 
a  data  set  with  vectors  of  a  specific  size,  which  is 
the  better  machine  for  execution?  Given  that  a  code, 
can  be  modeled  by  an  alternating  sequence  of 
strings  of  scalar  instructions  followed  by  strings 
of  parallel  instructions,  then  'if  can  be  represented  by 
the  following  sequence: 


If  a  compound  VECOP  operation  is  being  performed 
(i.e.  a  vector  add  and  multiply  or,  SAXPY),  the 
combined  rate,  r  can  be  calculated  by  the  sum  of 
resistances  formula.  For  example,  for  two  operations 
r,  and  fj,  the  combined  throughput,  r,,  is: 


1  I 

~  +  ~ 
n  fi 


or 


r, +  rj' 


(2) 


j,  ....  (5) 

where  w,  is  a  weight  representing  the  number  of 
operations  in  each  list,  s,  is  the  quoted  throughput 
for  scalar  operations  and  p,  is  the  calculated  through¬ 
put  for  the  parallel  sequence.  Then  the  throughput 
for  the  entire  sequence  can  be  calculated  from  the 
formula: 


For  n  operations,  this  generalizes  to: 


(3) 

inr- 

y. I IX/ 

The  r^  calculated  above  can  be  used  in  the  formula  for 
SIMD  processors  to  determine  the  throughput  for 
the  combined  VECOP  operation. 

The  peak  throughput  for  vector  machines  often 
quoted  is  calculated  by  multiplying  the  basic  cycle 
time  by  the  number  of  arithmetic  units  in  a  processor. 
However,  it  is  not  always  possible  to  make  full  use  of 
all  the  units.  For  example,  if  a  processor  has  two 
units,  a  vector  multiplying  can  only  be  executed  at 
half  the  quoted  rale  because  only  one  of  the  units  is 
effectively  used.  On  the  other  hand,  a  vector  multiply 
and  add  will  execute  at  the  full  rate.  y\nothcr  factor 
is  pipeline  setup  time.  This  factor  must  be  applied 
on  every  reload  of  the  vector  registers,  l-'or  vector 
machines,  let  v  be  the  vector  length,  r  the  quoted 
(maximum)  rate,  u  the  nuinher  of  units  per  procc.ssor, 
the  number  of  units  used  by  the  operation,  p  the 
pipeline  setup  time,’.j  the  resultiint  throughput  rate, 
iind  /  the  length  of  the  veeti'r  registers;  then 
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If  no  MIMD  parallelism  is  present,  this  reduces 
to: 
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The  basic  tenants  of  this  model  were  tested  using 
the  CONVEX  and  DAP  computers  in  the  NOSC 
Supcrconcurrancy  Laboratory.  The  CONVEX 
C-210  has  a  quoted  peak  rate  of  SOmflops  and 
the  DAP  510C  has  a  peak  rate  of  NOmflops  for 
1024  PEs. 

4.1.2.  Data  proJUin^.  An  important  component 
of  code  execution  modeling  is  data  profiling.  In 
a  general  purpose  heterogeneous  environment, 
where  many  machines  can  perform  the  same 
tusk,  .such  as  FFTs,  convolution  or  Gaussian  elimiit' 
ation,  the  question  is  which  machine  can  do  the 
best  job  (i.e.  execute  the  fastest)  on  the  s(X'ciflc  data 
set 

As  an  example  of  data  profiling  in  a  ILAsP  environ¬ 
ment.  consider  the  matrix  multiply,  l.el  A  be  an  i  \  J 
matrix.  (ii„).  and  II  a  ./  x  (h,»  I  The  product  (\ 
IS  an  I  '  K  matrix,  tt  1,  i,e 

I 


-  =  Z  -  or  r,  = 
r,  i.\r, 


Il  \>l’  la'lcro}>ciicmis  :issiiL'i;iii\f  piivc-Miii! 


In  order  to  eompuic  llic  I  K  terms.  ).  we  would 
need  ;i  triple  loop  of  the  fortn  (iissuming  tipproprialc 
initiali/tition): 

for  /|  =  I  to  /., 
for  A  =  I  to  /.; 
for  /,  =  I  to  /., 

4-  a„h,t 
end 
end 
end. 

The  Ls  arc,  of  course,  the  ranges.  /,  J  and  A', 
with  the  /s  being  their  corresponding  inriices,  /,  j 
and  k.  Mathematically,  it  does  not  matter  which 
of  the  six  possible  ways  these  arc  computed,  f^ow- 
ever,  Dongarra  et  a!!’  have  clearly  demonstrated  that 
these  arrangements  can  be  significantly  different  com¬ 
pute  times  when  computed  on  a  global  memory, 
vector  machine.  Since  there  arc  several  factors  that 
enter  into  these  performance  differences,  it  is  very 
much  the  concern  of  the  programmer  to  render 
the  order  of  the  loops  optimal  depending  on  the 
specific  circumstances.  In  the  HAsP  language,  this 
is  not  necessary,  since  the  data  profiling  does  this 
automatically. 

To  understand  how  this  is  done,  let  us  look  again 
at  the  schematic  code  above.  In  the  two  (of  the  six 
cases)  in  which  J  (and  its  associate  index,  y)  form  the 
inner  loop,  the  c*  is  a  scalar  (constant)  for  the 
purposes  of  the  inner  loop  and  this  is  called  an  SDOT 
type  of  operation;  the  two  vectors  (the  a,j  and  the  bji,) 
are  multiplied  component-wise  and  then  each  com¬ 
ponent  multiplicand  is  added  to  the  constant  c^. 
In  the  other  cases,  i.c.  where  J  is  not  the  innermost 
range,  we  have  the  SAXPY  family  of  operations,  in 
which  cither  the  or  the  b,t  arc  a  scalar  for  the 
purposes  of  inner  loop.  In  these  cases,  we  have  a 
scalar  (cither  a,y  or  A,*)  multiplied  to  each  component 
of  the  other  clement,  a  vector  in  the  inner  loop,  and 
then  each  component  multiplicand  is  added,  com¬ 
ponent-wise,  to  the  vector,  c* .  As  mentioned  above, 
a  number  of  (occasionally  conflicting)  factors  deter¬ 
mine  the  right  order  to  perform  the  loops.  For 
example,  it  is  generally  better  (in  FORTRAN)  to 
have  the  innermost  loop  on  the  leftmost  index  so  as 
to  avoid  non-unit  stride  through  memory  and  the 
increased  likelihood  of  bank  conflict  this  brings.  It  is 
also  usually  more  efficient  to  have  the  innermost  loop 
on  the  longest  range,  SAXPY  is  generally  better  for 
short  and  medium  length  vectors  whereas  SDOT  is 
better  for  longer  vectors  (at  least  one  reason  for  this 
is  that  SDOT  requires  summing  up  the  multiplicand 
terms  which  usually  requires  a  scalar  loop;  however. 


ii 

for  \ccuirs  ihis  is  dwarfed  by  ihe  reduced 

operation  count  h)  SDOT),  li  should  Ix’  noied 
that  for  a  SIMD  machine,  these  factors  are  largely 
irrelevant.  In  etich  I'l  the  >i,\  ctises  we  would  store  the 
coniptnienis  of  the  sectors  (tis  dictated  by  inner  loop) 
at  each  mule  and  brotidcast  the  scalar  value.  From 
the  point  of  view  of  ti  SIMD  machine,  it  matters  not 
whether  the  multiply  is  the  scaltir  broadcast  value 
(SAXPY)  or  not  (SDOT),  All  six  ca.scs  arc  essentially 
the  same. 

5.  (O.Vt  l.DStON 

The  HAsl^  approach  outlined  here  is  intended 
to  provide  a  flexible,  comprehensive  paradigm  for 
computation  on  heterogeneous  systems  of  supercom¬ 
puters.  The  paradigm  is  applicable  to  all  levels  of 
computing  from  single  heterogeneous  computers 
to  homogeneous  local  area  networks  on  up  to  large 
nation  wide  heterogeneous  networks.  The  system  is 
designed  to  minimize  data  movement  and  communi¬ 
cation  overhead  and  therefore  maximize  throughput 
and  execution  speed.  The  initial  formulas  for  code 
execution  modeling  have  been  developed  and  verified. 
The  concept  that  data  as  well  as  code  must  be  profiled 
was  developed  and  verified  by  experiment.  The  next 
aim  is  to  develop  a  prototype  HAsP  system  with 
two  or  three  levels  of  VHAM  in  a  heterogeneous 
environment. 

/<ckiioM’/e<(?eAiienK— This  work  is  sponsored  by  the  Office 
of  Naval  Technology,  the  U.S.  Navy-A,SEE  Summer 
Fellowship  Program.  Kent  Slate  University  and  the  Naval 
Command.  Control  and  Ocean  Surveillance  Center. 
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